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Abstract 

The conventional model for online planning under 
uncertainty assumes that an agent can stop and plan 
without incurring costs for the time spent planning. 
However, planning time is not free in most real- 
world settings. For example, an autonomous drone 
is subject to nature’s forces, like gravity, even while 
it thinks, and must either pay a price for counteract¬ 
ing these forces to stay in place, or grapple with the 
state change caused by acquiescing to them. Pol¬ 
icy optimization in these settings requires metarea¬ 
soning —a process that trades off the cost of plan¬ 
ning and the potential policy improvement that can 
be achieved. We formalize and analyze the metar¬ 
easoning problem for Markov Decision Processes 
(MDPs). Our work subsumes previously studied 
special cases of metareasoning and shows that in 
the general case, metareasoning is at most polyno- 
mially harder than solving MDPs with any given 
algorithm that disregards the cost of thinking. For 
reasons we discuss, optimal general metareason¬ 
ing turns out to be impractical, motivating approx¬ 
imations. We present approximate metareasoning 
procedures which rely on special properties of the 
BRTDP planning algorithm and explore the effec¬ 
tiveness of our methods on a variety of problems. 


1 Introduction 

Offline probabilistic planning approaches, such as policy iter¬ 
ation iHoward, I960], aim to construct a policy for every pos¬ 
sible state before acting. In contrast, online planners, such as 
RTDP I Barto et al., 1995| and UCT I Kocsis and Szepesvari, 
20061, interleave planning with execution. After an agent 
takes an action and moves to a new state, these planners sus¬ 
pend execution to plan for the next step. The more planning 
time they have, the better their action choices. Unfortunately, 
planning time in online settings is usually not free. 

Consider an autonomous Mars rover trying to decide what 
to do while a sandstorm is nearing. The size and uncertainty 
of the domain precludes a-priori computation of a complete 
policy, and demands the use of online planning algorithms. 
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Normally, the longer the rover runs its planning algorithm, 
the better decision it can make. However, computation costs 
power; moreover, if it reasons for too long without taking 
preventive action, it risks being damaged by the oncoming 
sandstorm. Or consider a space probe on final approach to a 
speeding comet, when the probe must plan to ensure a safe 
landing based on new information it gets about the comet’s 
surface. More deliberation time means a safer landing. At 
the same time, if the probe deliberates for too long, the comet 
may zoom out of range — a similarly undesirable outcome. 

Scenarios like these give rise to a general metareasoning 
decision problem : how should an agent trade off the cost 
of planning and the quality of the resulting policy for the 
base planning task every time it needs to make a move, so 
as to optimize its long-term utility? Metareasoning about 
base-level problem solving has bee n explored for probabilis¬ 
tic inference and decision making I Horvitz, 1 9871 Horvitz et 


al., 1989 1, theorem proving [Horvitz and Klein, 1995j Kautz 
et al., 2002), handling streams of problems I Hxtrvitz L j001 


Shaha faiid Horvitz, 2009| , and search [ [Russell and W efald, 


1991 Burns et al., 20131. There has been little work explor¬ 


ing generalized approaches to metareasoning for planning. 

We explore the general metareasoning problem for Markov 
decision processes (MDPs). We begin by formalizing the 
problem with a general but precise definition that subsumes 
several previously considered metareasoning models. Then, 
we show with a rigorous theoretical analysis that optimal gen¬ 
eral metareasoning for planning under uncertainty is at most 
polynomially harder than solving the original planning prob¬ 
lem with any given MDP solver. However, this increase in 
computational complexity, among other reasons we discuss, 
renders such optimal general metareasoning impractical. The 
analysis raises the issue of allocating time for metareason¬ 
ing itself, and leads to an infinite regress of meta*reasoning 
(metareasoning, metametareasoning, etc.) problems. 

We next turn to the development and testing of fast ap¬ 
proximate metareasoning algorithms. Our procedures use 
the Bounded RTDP (BRTDP [McMahan et al., 20051) algo¬ 
rithm to tackle the base MDP problem, and leverage BRTDP- 
computed bounds on the quality of MDP policies to reason 
about the value of computation. In contrast to prior work 
on this topic, our methods do not require any training data, 
precomputation, or prior information about target domains. 
We perform a set of experiments showing the performance 
of these algorithms versus baselines in several synthetic do- 



































mains with different properties, and characterize their perfor¬ 
mance with a measure that we call the metareasoning gap 
— a measure of the potential for improvement from metarea¬ 
soning. The experiments demonstrate that the proposed tech¬ 
niques excel when the metareasoning gap is large. 


2 Related Work 

Metareasoning efforts to date have employed strategies that 
avoid the complexity of the general metareasoning prob¬ 
lem for planning via relying on different kinds of simpli¬ 
fications and approximations. Such prior studies include 
metareasoning for time-critical decisions where expected 
value of computation is used to gui de probabilistic inference 
iHorvitz, 1987 Horvitz et al., 1989) , and work on the guid¬ 
ing of sequences of single actions in search I Russell and 
Wefald, 1991 Burns et al., 20131. Several lines of work 
have levera ged offline learning ||Breese and Horvitz, 1990 


Horvitz et al, 2001 Kautz et al., 2002| . Other studies have 

relied on optimizations and inferences that leverage the struc¬ 
ture of problems, such as the functional relationships between 
metareasoning and reasoning IHorvitz and Breese, 1990 


Zilberstein and Russell, 19961, the structure of the prob- 

lem sp ace IIHorvitz and Klein, 1995) , and the structure of 
utility IHorvitz, 2001 [ In other work, I Hansen and Zil¬ 
berstein, 20011 proposed a non-myopic dynamic program¬ 
ming solution for single-shot problems. Finally, several plan¬ 
ners rely on a heuristic form of online metareasoning when 
maximizing policy reward under computational constraints in 
real-world time with no ^conversion rate” between the two 
I Kolobov et al., 2012; Keller and Geifler, 20151. In con¬ 
trast, our metareasoning model is unconstrained, with com¬ 
putational and base-MDP costs in the same “currency.” 

Our investigation also has connections to research on al¬ 
locating time in a system composed of multiple sensing 
and planning components | Zilberstein and Russell, 1996, 
Zilberstein and Russell, 199317 on optimizing portfolios of 
planning strategies in scheduling applications I Dean et al., 
19951, an d on choosing actions to explore in Mon te Carlo 
planning I Hay et al., 2012 1. In other related work, I Chanel 
et al., 20141 consider how best to plan on one thread, while a 
separate thread processes execution. 


3 Preliminaries 

A key contribution of our work is formalizing the metarea¬ 
soning problem for planning under uncertainty. We build 
on the framework of stochastic shortest path (SSP) MDPs 
with a known start state. This general MDP class includes 
finite-horizon and discounted-reward MDPs as special cases 
iBertsekas and Tsitsiklis, 19961, and can also be used to ap¬ 
proximate partially observable MDPs with a fixed initial be¬ 
lief state. An SSP MDP M is a tuple (S,A,T,C,so,s g ), 
where S' is a finite set of states, A is a set of actions that the 
agent can take, T : (S, A , S) —> [0,1] is a transition func¬ 
tion, C : (S, A) —> ffi. is a cost function, so £ S' is the start 
state, and s g is the goal state. An SSP MDP must have a 
complete proper policy, a policy that leads to the goal from 
any state with probability 1, and all improper policies must 
accumulate infinite cost from every state from which they fail 
to reach the goal with a positive probability. The objective is 


to find a Markovian policy 7r So : S A with the minimum 
expected cost of reaching the goal from the start state sq — 
in SSP MDPs, at least one policy of this form is globally op¬ 
timal. 

Without loss of generality, we assume an SSP MDP to 
have a specially designated NOP (“no-operation”) action. 
NOP is an action the agent chooses when it wants to “idle” 
and “think/plan”, and its semantic meaning is problem- 
dependent. For example, in some MDPs, choosing NOP 
means staying in the current state for one time step, while in 
others it may mean allowing a tidal wave to carry the agent to 
another state. Designating an action as NOP does not change 
SSP MDPs’ mathematical properties, but plays a cmcial role 
in our metareasoning formalization. 

4 Formalization and Theoretical Analysis of 
Metareasoning for MDPs 

The online planning problem of an agent, which involves 
choosing an action to execute in any given state, is repre¬ 
sented as an SSP MDP that encapsulates the dynamics of the 
environment and costs of acting and thinking. We call this 
problem the base problem. The agent starts off in this en¬ 
vironment with some default policy, which can be as simple 
as random or guided by an unsophisticated heuristic. The 
agent’s metareasoning problem, then, amounts to deciding, 
at every step during its interaction with the environment, be¬ 
tween improving its existing policy or using this policy’s rec¬ 
ommended action while paying a cost for executing either of 
these options, so as to minimize its expected cost of getting 
to the goal. 

Besides the agent’s state in the base MDP, which we call 
the world state, the agent’s metareasoning decisions are con¬ 
ditioned on the algorithm the agent uses for solving the base 
problem, i.e., intuitively, on the agent’s thinking process. 
To abstract away the specifics of this planning algorithm 
for the purposes of metareasoning formalization, we view it 
as a black-box MDP solver and represent it, following the 
Church-Turing thesis, with a Turing machine B that takes a 
base SSP MDP M as input. In our analysis, we assume the 
following about Turing machine f?’s operation: 

• B is deterministic and halts on every valid base MDP 
M. This assumption does not affect the expressiveness 
of our model, since randomized Turing machines can be 
trivially simulated on deterministic ones, e.g., via seed 
enumeration (although potentially at an exponential in¬ 
crease in time complexity). At the same time, it greatly 
simplifies our theorems. 

• An agent’s thinking cycle corresponds to B executing a 
single instmction. 

• A configuration of B is a combination of B’ s tape con¬ 
tents, state register contents, head position, and next in¬ 
put symbol. It represents the state of the online planner 
in solving the base problem M. We denote the set of all 
configurations B ever enters on a given input MDP M 
as X B ( M \ We assume that B can be paused after ex¬ 
ecuting y instructions, and that its configuration at that 
point can be mapped to an action for any world state s 
of M using a special function / : S X X B AB —> A in 


































































time polynomial in M’s flat representation. The number 
of instructions needed to compute / is not counted into 
y. That is, an agent can stop thinking at any point and 
obtain a policy for its current world state. 

• An agent is allowed to “think” (i.e., execute /i’s instruc¬ 
tions) only by choosing the NOP action. If an agent de¬ 
cides to resume thinking after pausing B and executing a 
few actions, B re-starts from the configuration in which 
it was last paused. 

We can now define metareasoning precisely: 

Definition 1 . Metareasoning Problem. Consider an MDP 
M = (S, A,T,C, so, s g ) and an SSP MDP solver repre¬ 
sented by a deterministic Turing machine B. Let X B ( M ) be 
the set of all configurations B enters on input M, and let 
T b ( m ) : X B ( M ^ x —> {0,1} be the (deterministic) 

transition function of B on X B ^ M \ A metareasoning prob¬ 
lem for M with respect to B, denoted Metas(M) is an MDP 
(S m 1 A m , T m , C m , Sq 1 , s™) s.t. 

• S m =Sx X B ( M '> 

• A m = A 

• T m ((s,x),a,{s',x')) 

T(s, a, s') if a ^ NOP, x = x'> and a = f(s, x) 
T(s, a, s') ■ T B ( M \x> x') if a = NOP 
0 otherwise 

• C m ((s,x),a, (s' ,x')) = C(s,a,s') ifT(s,a,s') ^ 0, 
and 0 otherwise 

• s™ = (so,Xo)> where xo is the first configuration B 
enters on input M 

• s™ = (s g , x)> where x is any configuration in X B ( M ) 

Solving the metareasoning problem means finding a policy 
for Meta b(M) with the lowest expected cost of reaching s™. 

This definition casts a metareasoning problem for a 
base MDP as another MDP (a meta-MDP). Note that in 
Metas(M), an agent must choose either NOP or an action 
currently recommended by B(M)\ in other cases, the tran¬ 
sition probability is 0. Thus, Metas(M)’s definition essen¬ 
tially forces an agent to switch between two “meta-actions”: 
thinking or acting in accordance with the current policy. 

Modeling an agent’s reasoning process with a Turing ma¬ 
chine allows us to see that at every time step the metarea¬ 
soning decision depends on the combination of the current 
world state and the agent’s “state of mind,” as captured by the 
Turing machine’s current configuration. In principle, this de¬ 
cision could depend on the entire history of the two, but the 
following theorem implies that, as for M, at least one optimal 
policy for Met as (M) is always Markovian. 

Theorem 1 . If the base MDP M is an SSP MDP, then 
Metas{M) is an SSP MDP as well, provided that B halts 
on M with a proper policy. If the base MDP M is an infinite- 
horizon discounted-reward MDP, then so is Metas(Af). 
If the base MDP M is a finite-horizon MDP, then so is 
Metas(M). 


Proof. Verifying the result for finite-horizon and infinite- 
horizon discounted-reward MDPs M is trivial, since the only 
requirement Met (M) must satisfy in these cases is to have 
a finite horizon or a discount factor, respectively. 

If M is an SSP MDP, then, per the SSP MDP definition 
iBertsekas and Tsitsiklis ; _1996l, to ascertain the theorem’s 
claim we need to verify that (1) Meta b(M) has at least one 
proper policy and (2) every improper policy in Met as (M) 
accumulates an infinite cost from some state. 

To see why (1) is true, recall that Metas(M)’s state space 
is formed by all configurations Turing machine B enters on 
M. Consider any state (sg,Xo) of Met as (M). Since B is 
deterministic, as stated in Section 3, the configuration x!o lies 
in the linear sequence of configurations between the “des¬ 
ignated” initial configuration \o and the final proper-policy 
configuration that B enters according to the theorem. Thus, 
B can reach a proper-policy configuration from x!o- There¬ 
fore, let the agent starting in the state (s' 0 , x'o ) of Meta s(M) 
choose NOP until B halts, and then follow the proper policy 
corresponding to B’ s final configuration until it reaches a goal 
state s g of M. This state corresponds to a goal state (s g ,x) 
of Metas(M). Since this construct works for any (sq>Xo)’ 
it gives a complete proper policy for Meta b(M). 

To verify (2), consider any policy n m for Met as (M) that 
with a positive probability fails to reach the goal. Any infinite 
trajectory of 7r m that fails to reach the goal can be mapped 
onto a trajectory in M that repeats the action choices of n m, s 
trajectory in M’s state space S. Since M is an SSP MDP, 
this projected trajectory must accumulate an infinite cost, and 
therefore the original trajectory in Met as (M) must do so as 
well, implying the desired result. 

□ 

We now present two results to address the difficulty of 
metareasoning. 

Theorem 2. For an SSP MDP M and a deterministic Turing 
machine B representing a solver for M, the time complexity 
of MetasiM) is at most polynomial in the time complexity 
of executing B on M. 

Proof. The main idea is to construct the MDP representing 
Meta s(M) by simulating B on M. Namely, we can run B 
on M until it halts and record every configuration B enters to 
obtain the set X. Given X, we can construct S m = S x X and 
all other components of Met as (M) in time polynomial in 
\X\ and |M|. Constructing X itself takes time proportional to 
running time of B on M. Since, by Theoremn] Meta b(M) 
is an SSP MDP and hence can be solved in time polynomial 
in the size of its components, e.g., by linear programming, the 
result follows. 

□ 

Theorem 3. Metareasoning for SSP MDPs is P-complete 
under NC-reduction. (Please see the appendix for proof.) 

At first glance, the results above look encouraging. How¬ 
ever, upon closer inspection they reveal several subtleties 
making optimal metareasoning utterly impractical. First, al¬ 
though both SSP MDPs and their metareasoning counterparts 
with respect to an optimal polynomial-time solver are in P, 
doing metareasoning for a given MDP M is appreciably more 





expensive than solving that MDP itself. Given that the addi¬ 
tional complexity due to metareasoning cannot be ignored, 
the agent now faces the new challenge of allocating computa¬ 
tional time between metareasoning and planning for the base 
problem. This challenge is a meta-metareasoning problem, 
and ultimately causes infinite regress, an unbounded nested 
sequence of ever-costlier reasoning problems. 

Second, constructing Metag (M) by running B on M, 
as the proof of Theorem [2] proceeds, may entail solving M 
in the process of metareasoning. While the proof doesn’t 
show that this is the only way of constructing Metas(M), 
without making additional assumptions about B’ s operation 
one cannot exclude the possibility of having to run B until 
convergence and thereby completely solving M even before 
Metas(M) is fully formulated. Such a construction would 
defeat the purpose of metareasoning. 

Third, the validity of Theorems [2] and [3] relies on an im¬ 
plicit crucial assumption that the transitions of solver B on 
the base MDP M are known in advance. Without this knowl¬ 
edge.turns into a reinforcement learning prob¬ 
lem [Sutton and Barto, 19981, which further increases the 
complexity of metareasoning and the need for simulating B 
on M. Neither of these is viable in reality. 

The difficulties with optimal metareasoning motivate the 
development of approximation procedures. In this regard, the 
preceding analysis provides two important insights. It sug¬ 
gests that, since running B on M until halting is infeasible, it 
may be worth trying to predict /i’s progress on M. Many ex¬ 
isting MDP algorithms have clear operational patterns, e.g., 
evaluating policies in the decreasing order of their cost, as 
policy iteration does [Howard, 19601. Regularities like these 
can be of value in forecasting the benefit of running B on M 
for additional cycles of thinking. We now focus on exploring 
approximation schemes that can leverage these patterns. 


5 Algorithms for Approximate Metareasoning 

Our approach to metareasoning is guided by value of com¬ 
putation (VOC) analysis. In contrast to previous work that 
form ulates VOC for single actions or decision-making prob- 


lems I Horvitz, 1987| Horvitz et al., 1989||Russell and Wefald, 


19911, we aim to formulate VOC for online planning. For a 

given metareasoning problem Met(M), VOC at any en¬ 
countered state s m = (s , x) is exactly the difference between 
the Q-value of the agent following /(s, x) (the action recom¬ 
mended by the current policy of the base MDP M) and the 
Q-value of the agent taking NOP and thinking: 

VOC(s m ) = Q*(s m , f(s, x)) - Q*(s m , NOP). (1) 


VOC captures the difference in long-term utility between 
thinking and acting as determined by these Q-values. An 
agent should take the NOP action and think when the VOC is 
positive. Our technique aims to evaluate VOC by estimating 
Q*(s m ,/(s,x)) and Q* (s m , NOP). However, attempting to 
estimate these terms in a near-optimal manner ultimately runs 
into the same difficulties as solving Metas(M), such as sim¬ 
ulating the agent’s thinking process many steps into the fu¬ 
ture, and is likely infeasible. Therefore, fast approximations 
for the Q-values will generally have to rely on simplifying 
assumptions. We rely on performing greedy metareasoning 


analy sis as has been done in past studies of metareasoning 
l |Horvitz et al., 1 9891 |Russell and Wefald, 19911 : 

Meta-Myopic Assumption. In any state s m of the meta-MDP, 
we assume that after the current step, the agent will never 
again choose NOP, and hence will never change its policy. 

This meta-myopic assumption is important in allowing us 
to reduce VOC estimation to predicting the improvement in 
the value of the base MDP policy following a single thinking 
step. The weakness of this assumption is that opportunities 
for subsequent policy improvements are overlooked. In other 
words, the VOC computation only reasons about the current 
thinking opportunity. Nonetheless, in practice, we compute 
VOC at every timestep, so the agent can still think later. Our 
experiments show that our algorithms perform well in spite 
of their meta-myopicity. 


5.1 Implementing Metareasoning with BRTDP 

We begin the presentation of our approximation scheme with 
the selection of B, the agent’s thinking algorithm. Since ap¬ 
proximating Q*(s m ,/(s,x)) and Q*(s m ,NOP) essentially 
amounts to assessing policy values, we would like an on¬ 
line planning algorithm that provides efficient policy value 
approximations, preferably with some guarantees. Having 
access to these policy value approximations enables us to de¬ 
sign approximate metareasoning algorithms that can evaluate 
VOC efficiently in a domain-independent fashion. 

One algorithm with this property is Bounded RTDP 
(BRTDP) [McMahan et al., 20051. It is an anytime plan¬ 
ning algorithm based on RTDP I Barto et al., 19951. Like 
RTDP, BRTDP maintains a lower bound on an MDP’s op¬ 
timal value function V*, which is repeatedly updated via 
Bellman backups as BRTDP simulates trials/rollouts to the 
goal, making BRTDP’s configuration-to-configuration transi¬ 
tion function T B ( M \x> x') stochastic. A key difference is 
that in addition to maintaining a lower bound, it also main¬ 
tains an upper bound, updated in the same conceptual way 
as the lower one. If BRTDP is initialized with a mono¬ 
tone upper-bound heuristic, then the upper-bound decreases 
monotonic ally as BRTDP runs. The construction of domain- 
independent monotone bounds is beyond the scope of this 
paper, but is easy for the domains we study in our experi¬ 
ments. Another key difference between BRTDP and RTDP 
is that if BRTDP is stopped before convergence, it returns 
an action greedy with respect to the upper, not lower bound. 
This behavior guarantees that the expected cost of a policy re¬ 
turned at any time by a monotonically-initialized BRTDP is 
no worse than BRTDP’s current upper bound. Our metarea¬ 
soning algorithms utilize these properties to estimate VOC. 
In the rest of the discussion, we assume that BRTDP is ini¬ 
tialized with a monotone upper-bound heuristic. 


5.2 Approximating VOC 

We now show how BRTDP’s properties help us with estimat¬ 
ing the two terms in the definition of VOC, Q*(s m , f(s, x)) 
and Q* ( s m , NOP). We first assume that one “thinking cycle” 
of BRTDP (i.e., executing NOP once and running BRTDP in 
the meantime, resulting in a transition from BRTDP’s cur¬ 
rent configuration x to another configuration x0 corresponds 
to completing some fixed number of BRTDP trials from the 
agent’s current world state s. 























Estimating Q*(s m . NOP) 

We first describe how to estimate the value of taking the NOP 
action (thinking). At the highest level, this estimation first 
involves writing down an expression for Q*(s m , NOP), mak¬ 
ing a series of approximations for different terms, and then 
modeling the behavior of how BRTDP’s upper bounds on the 
Q-value function drop in order to compute the needed quan¬ 
tities. 

When opting to think by choosing NOP, the agent may tran¬ 
sition to a different world state while simultaneously updating 
its policy for the base problem. Therefore, we can express 

Q* (s m , NOP) = 

Y T(s, NOP, s') Y T B{M) (x, x'W*((s', xO). (2) 
s' x' 

Because of meta-myopicity, we have V*((s', x')) = V x (s') 
where V x ' is the value function of the policy correspond¬ 
ing to x' i n the base MDP. However, this expression can¬ 
not be efficiently evaluated in practice, since we do not know 
BRTDP’s transition distribution (^. x') nor the state 

values V x (s'), forcing us to make further approximations. 
To do so, we assume V x and Q x are random variables, and 
rewrite Y) x > T B{M Xx,x')V x ' W) = 

Y = a)E[Q x \s', a)\A x ', = a}. (3) 

a 

where the random variable A x , takes value a iff /(s', x') = a 

after one thinking cycle in state (s, x). Intuitively, P(A X , = 
a ) denotes the probability that BRTDP will recommend ac¬ 
tion a in state s' after one thinking cycle. Now, let us denote 
the Q-value upper bound corresponding to BRTDP’s current 
configuration x as Q . This value is known. Then, let the up¬ 
per bound corresponding to BRTDP’s next configuration x ', 

be Q x . Because we do not know x'^ this value is unknown , 
and is a random variable. Because BRTDP selects actions 
greedily w.r.t. the upper bound, we follow this behavior and 
use the upper bound to estimate Q-value by assuming that 

Q x = Q . Since the value of Q is unknown at the time of 

the VOC computation, P( A x , = a)&nAE[Q x (s',a)\A x = 
a ] are computed by integrating over the possible values of 

Q x . We have that E[Q X (s', a)| A*' = a] = 





and P(A X = a) = 


a\Q x (s',a))P(Q x (s',a)) 
P(A X ', = a) 



P(Q X \s',a )) P(Q X '(s',di) > P(Q X '(s',a)). 

ai^a 


Therefore, we must model the distribution that Q x is drawn 

from. We do so by modeling the change A Q = Q x — Q x , 
due to a single BRTDP thinking cycle that corresponds to a 
transition from configuration x to x! ■ Since Q is known and 


fixed, estimating a distribution over possible A Q gives us a 
distribution over Q . 

Let A Q 3 a be the change in Q s a resulting from the most 
recent thinking cycle for some state s and action a. We first 
assume that the change resulting from an additional cycle 
of thinking, A Q s a , will be no larger than the last change, 

AQ S , 0 < A Q s a . This assumption is reasonable, because 
we can expect the change in bounds to decrease as BRTDP 
converges to the true value function. Given this assumption, 
we must choose a distribution D over the interval [0, A Q s 0 ] 
such that for the next thinking cycle, A Q s a ~ I). Figure^ 
illustrates these modeling assumptions for two hypothetical 
actions, ai and a, 2 - 

One option is to make D uniform, so as to represent our 
poor knowledge about the next bound change. Then, com¬ 
puting P(A X , = a) involves evaluating an integral of a poly¬ 
nomial of degree 0(|A|) (the product of j/1 — 1 CDF’s and 

1 PDF), and computing E[Q X (s',a)|A} : = a] also entails 
evaluating an integral of degree 0(|A|), and thus computing 
these quantities for all actions in a state can be computed in 
time 0(\A\ 2 ). Since the overall goal of this subsection, ap¬ 
proximating Q*(s m ,NOP), requires computing P(A X = a) 
for all actions in all states where NOP may lead, assuming 
there are no more than K « A such states, the complexity 
becomes 0(K\A\ 2 ) for each state visited by the agent on its 
way to the goal. 

A weakness of this approach is that the changes in the up¬ 
per bounds for different actions are modeled independently. 
For example, if the upper bounds for two actions in a given 
state decrease by a large amount in the previous thinking step, 
then it is unlikely that in the next thinking step one of them 
will drop dramatically while the other drops very little. This 
independence can cause the amount of uncertainty in the up¬ 
per bound at the next thinking step to be overestimated, lead¬ 
ing to VOC being overestimated as well. 

Therefore, we create another version of the algorithm as¬ 
suming that the speed of decrease in Q-value upper bounds 
for all actions are perfectly correlated; all ratios between fu¬ 
ture drops in the next thinking cycle are equal to the ratios 
between the observed drops in the last thinking cycle. For¬ 
mally, for a given state s, we let p ~ Uniform[0, 1], Then, let 
A Q s a = p ■ A Q s a for a ll actions a. 

Now, to compute P(A X , = a), for each action a, we repre- 
sent the range of its possible future Q-values Q s a with a line 
segment l a on the unit interval [0,1] where ( a (0) = Q X a and 

la( 1) = Q X y a - AQs,a- Then, P(A X , = a) is simply the pro¬ 
portion of l a which lies below all the other lines representing 
all other actions. We can naively compute these probabilities 
in time 0(|A| 2 ) by enumerating all intersections. Similarly, 

computing E[Q X (s',a)|AJ = a] is also easy. This value is 
the mean of the portion of l a that is beneath all other lines. 
Figure |T]i illustrates these computations. 

Whether or not we make the assumption of action inde¬ 
pendence, we further speed up the computations by only cal¬ 
culating E[Q X (s', a)|^4J = a] and P(A X = a) for the two 




Figure 1: a) Hypothetical drops in upper bounds on the Q- 
values of two actions, a\ and a 2 ■ We assume the nex^Q- 
value drop resulting from another cycle of thinking, A Q, is 
drawn from a range equal to the last drop from thinking, A Q 
b) Assuming perfect correlation in the speed of decrease in 
the Q-value upper bounds, as the upper bounds of the two 
actions drop from an additional cycle of thinking, initially « 2 
has a better upper bound, but eventually ai overtakes « 2 . 

“most promising” actions a, those with the lowest expectation 
of potential upper bounds. This limits the computation time 
to the time required to determine these actions (linear in |A|), 
and makes the time complexity of estimating Q*(s m ,NOP) 
for one state she 0{K\A\) instead of 0(K\A\ 2 ). 

Estimating Q* (s m , f (s, x)) 

Now that we have described how to estimate the value of tak¬ 
ing the NOP action, we describe how to estimate the value 
of taking the currently recommended action, /(s,x)- We 

estimate Q*{s m J(s,x )) by computing E[Q V (s,/(s,x))L 
which takes constant time, keeping the overall time complex¬ 
ity linear. The reason we estimate Q*(s m , f(s, %)) using fu¬ 
ture Q-value upper bound estimates based on a probabilis¬ 
tic projection of x', as opposed to our current Q-value up¬ 
per bounds based on the current configuration is to make 
use of the more informed bounds derived at the future utility 
estimation. As the BRTDP algorithm is given more compu¬ 
tation time, it can more accurately estimate the upper bound 
of a policy. This type of_approximation has been justified 
before [Russell and Wefald, 1991J. In addition, using fu¬ 
ture utility estimates in both estimating Q*(s m , f(s, x)) and 
Q*(s m ,NOP) provides a consistency guarantee: if thinking 
leads to no policy change, then our method estimates VOC 
to be zero. 

5.3 Putting It All Together 

The core of our algorithms involves the computations we 
have described, in every state s the agent visits on the 
way to the goal. In the experiments, we denote UnCorr 
Metareasoner as the metareasoner that assumes the ac¬ 
tions are uncorrelated, and Metareasoner as the metarea¬ 
soner that does not make this assumption. To complete the 
algorithms, we ensure that they decide the agent should think 
for another cycle if AQ S a isn’t yet available for the agent’s 
current world state s (e.g., because BRTDP has never updated 


bounds for this state’s Q-value so far), since V OC compu¬ 
tation is not possible without prior observations on A Q s a . 
Crucially, all our estimates make metareasoning take time 
only linear in the number of actions, 0(K\A\), per visited 
state. 

6 Experiments 

We evaluate our metareasoning algorithms in several syn¬ 
thetic domains designed to reflect a wide variety of factors 
that could influence the value of metareasoning. Our goal is 
to demonstrate the ability of our algorithms to estimate the 
value of computation and adapt to a plethora of world condi¬ 
tions. The experiments are performed on four domains, all of 
which are built on a 100 x 100 grid world, where the agent can 
move between cells at each time step to get to the goal located 
in the upper right corner. To initialize the lower and upper 
bounds of BRTDP, we use the zero heuristic and an appropri¬ 
ately scaled (multiplied by a constant) Manhattan distance to 
the goal, respectively. 

6.1 Domains 

The four world domains are as follows: 

• Stochastic. This domain adds winds to the grid world 
to be analogous to worlds with stochastic state transi¬ 
tions. Moving against the wind causes slower movement 
across the grid, whereas moving with the wind results in 
faster movement. The agent’s initial state is the south¬ 
east corner and the goal is located in the northeast cor¬ 
ner. We set the parameters of the domain as follows so 
that there is a policy that can get the agent to the goal 
with a small number of steps (in tens instead of hun¬ 
dreds) and where the winds significantly influence the 
number of steps needed to get to the goal: The agent 
can move 11 cells at a time and the wind has a push¬ 
ing power of 10 cells. The next location of the agent is 
determined by adding the agent’s vector and the wind’s 
vector except when the agent decides to think (executes 
NOP), in which case it stays in the same position. Thus, 
the winds can never push the agent in the opposite direc¬ 
tion of its intention. The prevailing wind direction over 
most of the grid is northerly, except for the column of 
cells containing the goal and starting position, where it is 
southerly. Note that this southerly wind direction makes 
the initial heuristic extremely suboptimal. To simulate 
stochastic state transitions, the winds have their prevail¬ 
ing direction in a given cell with 60% probability; with 
40% probability they have a direction orthogonal to the 
prevailing one (20% easterly and 20% westerly). 

We perform a set of experiments on this simplest do¬ 
main of the set, to observe the effect of different costs 
for thinking and acting on the behaviors of algorithms. 
We vary the cost of thinking and acting between 1 and 
15. When we vary the cost of thinking, we fix the cost 
of acting at 11, and when we vary the cost of acting, we 
fix the cost of thinking at 1. 

• Traps. This domain modifies the Stochastic domain to 
resemble the setting where costs for thinking and act¬ 
ing are not constant among states. To simplify the pa¬ 
rameter choices, we fix the cost of thinking and acting 











to be equal, respectively, to the agent’s moving distance 
and wind strength. Thus, the cost of thinking is 10 and 
the cost of acting is 11. To vary the costs of thinking 
and acting between states, we make thinking and act¬ 
ing at the initial state extremely expensive at a cost of 
100, about 10 times the cost of acting and thinking in 
the other states. Thus, the agent is forced to think out¬ 
side its initial state in order to perform optimally. 

• DynamicNOP-1. In the previous domains, executing a 
NOP does not change the agent’s state. In this domain, 
thinking causes the agent to move in the direction of the 
wind, causing the agent to stochastically transition as a 
result of thinking. In this domain, the cost of thinking 
is composed of both explicit and implicit components; 
a static value of 1 unit and a dynamic component deter¬ 
mined by stochastic state transitions as a result of think¬ 
ing. The static value is set to 1 so that the dynamic com¬ 
ponent can dominate the decisions about thinking. The 
agent starts in cell (98,1). We change the wind direc¬ 
tions so that there are easterly winds in the most southern 
row and northerly winds in the most eastern row that can 
push the agent very quickly to the goal. Westerly winds 
exist everywhere else, pushing the agent away from the 
goal. We change the stochasticity of the winds so that 
the westerly winds change to northerly winds with 20% 
probability, and all other wind directions are no longer 
stochastic. We lower the amount of stochasticity to bet¬ 
ter see if our agents can reason about the implicit costs 
of thinking. The wind directions are arranged so that 
there is potential for the agent to improve upon its initial 
policy but thinking is risky as it can move the agent to 
the left region, which is hard to recover from since all 
the winds push the agent away from the goal. 

• DynamicNOP-2. This domain is just like the previous 
domain, but we change the direction of the winds in the 
northern-most row to be easterly. These winds also do 
not change directions. In this domain, as compared to 
the previous one, it is less risky to take a thinking ac¬ 
tion; even when the agent is pushed to the left region of 
the board, the agent can find strategies to get to the goal 
quickly by utilizing the easterly wind at the top region 
of the board. 

6.2 The Metareasoning Gap 

We introduce the concept of the metareasoning gap as a 
way to quantify the potential improvement over the initial 
heuristic-implied policy, denoted as Heuristic, that is 
possible due to optimal metareasoning. The metareasoning 
gap is the ratio of the expected cost of Heuristic for the 
base MDP to the expected cost of the optimal metareasoning 
policy, computed at the initial state. Exact computation of the 
metareasoning gap requires evaluating the optimal metarea¬ 
soning policy and is infeasible. Instead, we compute an upper 
bound on the metareasoning gap by substituting the cost of 
the optimal metareasoning policy with the cost of the optimal 
policy for the base MDP (denoted OptimalBase). The 
metareasoning gap can be no larger than this upper bound, 
because metareasoning can only add cost to OptimalBase. 
We quantify each domain with this upper bound ( MG UB ) 


in Table [T] and show that our algorithms for metareasoning 
provide significant benefits when MG UB is high. We note 
that none of the algorithms use the metareasoning gap in its 
reasoning. 



Heuristic 

OptimalBase 

MG ub 

Stochastic (Thinking) 

1089 

103.9 

10.5 

Stochastic (Acting) 

767.3 

68.1 

11.3 

Traps 

979 

113.5 

8.6 

DynamicNOP-1 

251.4 

66 

3.8 

DynamicNOP-2 

119.4 

66 

1.8 


Table 1: Upper bounds of metareasoning gaps ( MG UB ) for 
all test domains, defined as the ratio of the expected cost of 
the initial heuristic policy (Heuristic) to that of an optimal 
one (OptimalBase) at the initial state. 

6.3 Experimental Setup 

We compare our metareasoning algorithms against a number 
of baselines. The Think*Act baseline simply plans for n 
cycles at the initial state and then executes the resulting pol¬ 
icy, without planning again. We also consider the Prob base¬ 
line, which chooses to plan with probability p at each state, 
and executes its current policy with probability 1 — p. An im¬ 
portant drawback of these baselines is that their performance 
is sensitive to their parameters n and p, and the optimal pa¬ 
rameter settings vary across domains. The NoInfoThink 
baseline plans for another cycle if it does not have informa¬ 
tion about how the BRTDP upper bounds will change. This 
baseline is a simplified version of our algorithms that does 
not try to estimate the VOC. 

For each experimental condition, we run each metareason¬ 
ing algorithm until it reaches the goal 1000 times and average 
the results to account for stochasticity. Each BRTDP trajec¬ 
tory is 50 actions long. 

6.4 Results 

In Stochastic , we perform several experiments by varying the 
costs of thinking (NOP) and acting. We observe (figures can 
be found in the appendix) that when the cost of thinking 
is low or when the cost of acting is high, the baselines do 
well with high values of n and p, and when the costs are re¬ 
versed, smaller values do better. This trend is expected, since 
lower thinking cost affords more thinking, but these baselines 
don’t allow for predicting the specific “successful” n and p 
values in advance. Metareasoner does not require pa¬ 
rameter tuning and beats even the best performing baseline 
for all settings. Figure [2ji compares the metareasoning al¬ 
gorithms against the baselines when the results are averaged 
over the various settings of the cost of acting, and Figure [ 2 J 5 
shows results averaged over the various settings of the cost of 
thinking. Metareasoner does extremely well in this do¬ 
main because the metareasoning gap is large, suggesting that 
metareasoning can improve the initial policy significantly. 
Importantly, we see that Metareasoner performs better 
than NoInfoThink, which shows the benefit from reason¬ 
ing about how the bounds on Q-values will change. UnCorr 
Metareasoner does not do as well as Metareasoner, 
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Number of Thinking Cycles n for Think*Act Baseline Number of Thinking Cycles n for Think*Act Baseline Number of Thinking Cycles n for Think*Act Baseline 

c) Traps d) DynamicNOP-1 e) DynamicNOP-2 

Figure 2: Comparison of Metareasoner and Uncorr Metareasoner with baselines on experimental domains. Some 
figures do not include Heuristic when it performs especially poorly for readability. 


probably because the assumption that actions’ Q-values are 
uncorrelated does not hold well. 

We now turn to Traps , where thinking and acting in the ini¬ 
tial state incurs significant cost. Figure |2j: again summarizes 
the results. Think*Act performs very poorly, because it is 
limited to thinking only at the initial state. Metareasoner 
does well, because it figures out that it should not think 
in the initial state (beyond the initial thinking step), and 
should instead quickly move to safer locations. Uncorr 
Metareasoner also closes the metareasoning gap signif¬ 
icantly, but again not as much as Metareasoner. 

We now consider DynamicNOP-1 , a domain adversarial 
to approximate metareasoning, because winds almost every¬ 
where push the agent away from the goal. There are only 
a few locations from which winds can carry the agent to 
the destination. Figure [2jl shows that our algorithms do not 
achieve large gains. However, this result is not surprising. 
The best policy involves little thinking, because whenever 
the agent chooses to think, it is pushed away from the goal, 
and thinking for just a few consecutive time steps can take 
the agent to states where reaching the goal is extremely dif¬ 
ficult. Consequently, Think*Act with 1-3 thinking steps 
turns out to be near-optimal, since it is pushed away from the 
goal only slightly and can use a slightly improved heuristic 
to head back. Metareasoner actually does well in many 
individual runs, but occasionally thinks longer due to VOC 
computation stochasticity and can get stuck, yielding higher 
average policy cost. In particular, it may frequently be pushed 
into a state that it has never encountered before, where it must 
think again because it does not have any history about how 
BRTDP’s bounds have changed in that state, and then subse¬ 


quently get pushed into an unencountered state again. In this 
domain, our approximate algorithms can diverge away from 
an optimal policy, which would plan very little to minimize 
the risk of being pushed away from the goal. 

DynamicNOP-2 provides the agent more opportunities 
to recover even if it makes a poor decision. Figure [5Ji 
demonstrates that our algorithms perform much better in 
DynamicNOP-2 than in DynamicNOP-1. In DynamicNOP-2, 
even if our algorithms do not discover the jetstreams that can 
push it towards the goal from initial thinking, they are pro¬ 
vided more chances to recover when they get stuck. When 
thinking can move the agent on the board, having more op¬ 
portunities to recover reduces the risk associated with mak¬ 
ing suboptimal thinking decisions. Interestingly, the metar¬ 
easoning gap is decreased at the initial state by the addition 
of the extra jetstream. However, the metareasoning gap at 
many other states in the domain is increased, showing that 
the metareasoning gap at the initial state is not the most ideal 
way to characterize the potential for improvement via metar¬ 
easoning in all domains. 

7 Conclusion and Future Work 

We formalize and analyze the general metareasoning problem 
for MDPs, demonstrating that metareasoning is only polyno- 
mially harder than solving the base MDP Given the determi¬ 
nation that optimal general metareasoning is impractical, we 
turn to approximate metareasoning algorithms, which esti¬ 
mate the value of computation by relying on bounds given by 
BRTDP. Finally, we empirically compare our metareasoning 
algorithms to several baselines on problems designed to re¬ 
flect challenges posed across a spectrum of worlds, and show 

















that the proposed algorithms are much better at closing large 
metareasoning gaps. 

We have assumed that the agent can plan only when it takes 
the NOP action. A generalization of our work would allow 
varying amounts of thinking as part of any action. Some ac¬ 
tions may consume more CPU resources than others, and ac¬ 
tions which do not consume all resources during execution 
can allocate the remainder to planning. We also can relax the 
meta-myopic assumption, so as to consider the consequences 
of thinking for more than one cycle. In many cases, assuming 
that the agent will only think for one more step can lead to un¬ 
derestimation of the value of thinking, since many cycles of 
thinking may be necessary to see significant value. This abil¬ 
ity can be obtained with our current framework by projecting 
changes in bounds for multiple steps. However, in experi¬ 
ments to date, we have found that pushing out the horizon 
of analysis was associated with large accumulations of errors 
and poor performance due to approximation upon approxi¬ 
mation from predictions about multiple thinking cycles. Fi¬ 
nally, we may be able to improve our metareasoners by learn¬ 
ing about and harnessing more details about the base-level 
planner. In our Metareasoner approximation scheme, we 
make strong assumptions about how the upper bounds pro¬ 
vided by BRTDP will change, but learning distributions over 
these changes may improve performance. More informed 
models may lead to accurate estimation of non-myopic value 
of computation. However, learning distributions in a domain- 
independent manner is difficult, since the planner’s behavior 
is heavily dependent on the domain and heuristic at hand. 
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Output SSP MDP M' 


Figure 3: The construction of a base MDP M' for a metarea¬ 
soning problem from an input SSP MDP M. 

A Appendix 

A.l Proof of Theorem 3 

Proof. By calling metareasoning P-complete we mean that 
there exists a Turing machine B s.t. (1) for any input SSP 
MDP M 7 , Met &b (M 7 ) can be decided in time polynomial 
in \M'\, i.e., Meta B (M) is in P, and (2) there is a class of 
P-complete problems that can be converted to Meta B (M 7 ) 
via an iVC-reduction, i.e., by constmcting M 7 appropriately 
using a polynomial number of parallel Turing machines, each 
operating in polylogarithmic time. 

The first part of the above claim follows from Theorem 2: 
since SSP MDPs are solvable optimally by linear program¬ 
ming in polynomial time, Meta B (M) is in P if B encodes a 
polynomial solver for linear programs. 

For the second part, we perform an NC-reduction from the 
class of SSP MDPs to the class of SSP MDPs-based metarea¬ 
soning problems with respect to a fixed optimal polynomial¬ 
time solver B. Specifically, given an SSP MDP M with an 
initial state, we show how to construct another SSP MDP M 7 
s.t., for the optimal polynomial-time solver B we describe 
shortly, deciding Meta B (M 7 ) is equivalent to deciding M. 

The intuition behind converting a given SSP MDP M into 
M'. the SSP MDP that will serve as the base in our metarea¬ 
soning problem, is to augment M with new states where the 
agent can “think" by using a zero-cost NOP action until the 
agent arrives at an optimal policy for the original states of M. 
Afterwards, the agent can transition from any of these newly 
added “thinking states” to M’s original start state so and ex¬ 
ecute the optimal policy from there. Unfortunately, the proof 
is not as straightforward as it seems, because we cannot sim¬ 
ply build M’ by equipping M with a new start state Sq with 
a self-loop zero-cost NOP action — M 7 with such an action 
would violate the SSP MDP definition. Below, we show how 
to overcome this difficulty. Since thinking in the newly added 
states of M' costs nothing, the cost of an optimal policy for 
Meta b(M') is the same as for M, so deciding the former 
problem decides the latter. 

The constmction of M' from a given SSP MDP M is illus¬ 
trated in Figure [3] Consider the number of instruction-steps 
it takes to solve M by linear programming. This number is 
polynomial; namely, there exists a polynomial plp(|M|) that 
bounds M’s solution time from above. To transform M into 
M', we add a set ofp LP (|M|) states, s' 0 , si,..., s ' pLp{ | M!) to 
M. These new states connect into a chain via zero-cost NOP 


actions: the start state Sg of M' links to si, si links to s' 2 , and 
so on until s ' Plp (\m\) links to so, the start state of M. In ad¬ 
dition, for all original states of M, we create a self-loop NOP 
action with a positive cost. The entire transformation can be 
easily implemented as an NC-reduction onp PP (|M|) + |Sj 
computers, each recording the cost and transition function of 
NOP for a separate state. Since for each state, NOP’s cost and 
transition functions together can be encoded by just two num¬ 
bers (NOP transition function assigns probability 1 to a single 
transition that is implicitly but unambiguously determined for 
every state), each computer operates in polylogarithmic time. 
Moreover, initializing each of the parallel machines with the 
MDP state for which it is supposed to write out the transition 
and cost function values is as simple as appropriately setting a 
pointer to the input tape, and can be done in log-space. Thus, 
the above procedure is a valid NC-reduction. Note also that 
M 7 is an SSP MDP: although it has zero-cost actions, they do 
not form loops/strongly connected components. 

Our motivation for constructing M' as above was to pro¬ 
vide an agent with enough states where it can “think” to guar¬ 
antee that if the agent starts at Sg, it arrives at M’s initial state 
So with a computed optimal policy from so onwards. This 
would imply that the expected cost of an optimal policy for 
Metas(M') from so would be the same as for M. How¬ 
ever, for this guarantee to hold, we need a general SSP MDP 
solver B that can solve/decide M' in time 0(poly(\M\)), not 
0(poly(\M'\). The difference between 0(poly(\M'\)) and 
0(poly(\M\)) is very important, because M 7 is larger than 
M, so the newly added chain of states may not be enough for 
a 0(poly{\M '\)) policy computation to have zero cost. 

To circumvent this issue, we define B that recognizes 
“lollypop-shaped” MDPs M' as in Figure [3j which have an 
arbitrarily connected subset S c of the state space representing 
a sub-MDP M c preceded by a chain of NOP-connected states 
of size plp{ |M c |) leading to M c ’s start state s 0 , and ignores 
the linear chain part. (Note that the policy for the linear chain 
part is determined uniquely and there is no need to write it out 
explicitly). For that, we assume the metareasoning problem’s 
input SSP MDP to be in the form of a string 

M c _description###chain_description 

In this string, MCdescription stands for the arbitrarily con¬ 
nected part of the input MDP, and chain.description stands for 
the description of the linear NOP-connected chain. For MDPs 
violating conditions in Figure [3] (i.e., having a different con¬ 
nectivity stmcture or having the linear part of the wrong size), 
chain_description must be empty, with the entire MDP de¬ 
scription placed before “###”. B is defined to read that input 
string only up to “###” and solve that part using LP. 

Constructing M' from M and recording M' in the afore¬ 
mentioned way ensures that the optimal policy for the metar¬ 
easoning problem Metas(M') chooses NOP until the agent 
reaches M’s start state so, by which point the agent will have 
computed an optimal policy for M. Coupled with the fact 
that Meta p (M 7 ) is in P, this implies the theorem’s claim. 

□ 

A.2 More Figures 

Figures |4| through [TT| show results for the Stochastic domain 
where we vary the cost of thinking and the cost of acting. 
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Figure 4: Comparison of algorithms in Stochastic, with the 
cost of thinking = 1 and cost of acting =11 
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Figure 6: Comparison of algorithms in Stochastic, with the 
cost of thinking =10 and cost of acting =11 
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Figure 5: Comparison of algorithms in Stochastic, with the 
cost of thinking = 5 and cost of acting =11 
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Figure 7: Comparison of algorithms in Stochastic, with the 
cost of thinking = 15 and cost of acting =11 
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Figure 8: Comparison of algorithms in Stochastic, with the 
cost of acting = 1 and cost of thinking = 1 
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Figure 9: Comparison of algorithms in Stochastic, with the 
cost of acting = 5 and cost of thinking = 1 
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Figure 10: Comparison of algorithms in Stochastic, with the 
cost of acting =10 and cost of thinking = 1 
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Figure 11: Comparison of algorithms in Stochastic, with the 
cost of acting =15 and cost of thinking = 1 






















